Ensemble learning model for identifying the hallmark genes of NFκB/TNF signaling pathway in cancers

Background The nuclear factor kappa B (NFκB) regulatory pathways downstream of tumor necrosis factor (TNF) play a critical role in carcinogenesis. However, the widespread influence of NFκB in cells can result in off-target effects, making it a challenging therapeutic target. Ensemble learning is a machine learning technique where multiple models are combined to improve the performance and robustness of the prediction. Accordingly, an ensemble learning model could uncover more precise targets within the NFκB/TNF signaling pathway for cancer therapy. Methods In this study, we trained an ensemble learning model on the transcriptome profiles from 16 cancer types in the TCGA database to identify a robust set of genes that are consistently associated with the NFκB/TNF pathway in cancer. Our model uses cancer patients as features to predict the genes involved in the NFκB/TNF signaling pathway and can be adapted to predict the genes for different cancer types by switching the cancer type of patients. We also performed functional analysis, survival analysis, and a case study of triple-negative breast cancer to demonstrate our model's potential in translational cancer medicine. Results Our model accurately identified genes regulated by NFκB in response to TNF in cancer patients. The downstream analysis showed that the identified genes are typically involved in the canonical NFκB-regulated pathways, particularly in adaptive immunity, anti-apoptosis, and cellular response to cytokine stimuli. These genes were found to have oncogenic properties and detrimental effects on patient survival. Our model also could distinguish patients with a specific cancer subtype, triple-negative breast cancer (TNBC), which is known to be influenced by NFκB-regulated pathways downstream of TNF. Furthermore, a functional module known as mononuclear cell differentiation was identified that accurately predicts TNBC patients and poor short-term survival in non-TNBC patients, providing a potential avenue for developing precision medicine for cancer subtypes. Conclusions In conclusion, our approach enables the discovery of genes in NFκB-regulated pathways in response to TNF and their relevance to carcinogenesis. We successfully categorized these genes into functional groups, providing valuable insights for discovering more precise and targeted cancer therapeutics. Supplementary Information The online version contains supplementary material available at 10.1186/s12967-023-04355-5.

for feature rescaling to eliminate the analysis error caused by the difference among patients. Since the patients are features, the feature standardization transforms the distribution of each patient's gene expression level with zero mean and unit variance. It was calculated as follows: where and ′ are the original and standardized values of gene expression level, respectively; ̅ is the mean of the patient's gene expression level, and is the standard deviation. The two external mRNA expression profiles of breast cancer (E-GEOD-58135 [2] and E-GEOD-76250 [3]) that were used to validate the capability of the identified functional module in predicting TNBC were downloaded from the ArrayExpress.

Hallmark genes and cancer-dysregulated genes (C6) in MSigDB
During model training, we used the "HALLMARK_TNFA_SIGNALING_VIA_NFKB" hallmark gene set (abbreviated as NFκB/TNF hallmark genes) from MSigDB [4], which covers 200 genes, as the positive data (sample). Out of the 200 genes in this gene set, the TCGA RNA-Seq data covers 198 genes. In addition, there are 4,304 genes covered by other hallmark gene sets in MSigDB. Of these, 157 genes are also included in the NFκB/TNF hallmark gene set, and other hallmark genes cover the remaining 4,147 genes. We excluded these 4,147 genes from the training stages to avoid potential disturbances during model learning, as they may be potentially associated with cancer progression. We then applied the trained ensemble learning model to the complete genome data, including the 4,147 genes that were previously excluded, to predict the potential cancerinfluential genes involved in the NFκB/TNF pathway. To further investigate the oncogenicity of the predicted candidates that are not NFκB/TNF hallmark genes and have at least one vote across all cancer types, we obtained the oncogenic gene set (C6) from MSigDB. The C6 gene set consists of 189 gene sets, totaling 12,493 genes, and encompasses genes that are dysregulated in cancers by affecting known cancer genes.

Functional similarity analysis of the identified candidate genes
We downloaded functional annotation data of genes from the Gene Ontology (GO) database. In this study, we only used the biological processes domain. We used the Jaccard index, calculated from the shared experimentally validated biological processes, to evaluate the functional similarity between each gene in the RNA-Seq data and the six core member genes of the TNF and NFκB family genes. Notably, only the experimentally validated biological processes with a depth smaller than seven in the GO database were used to calculate the Jaccard index similarity. In the GO tree structure, 'depth' is defined as the shortest path length from a GO term to the root term, in this case, the 'biological process' (GO:0008150). To determine if the candidate genes significantly participate in the NFκB-regulated pathways downstream of TNF, we employed gene set enrichment analysis (GSEA). We further excluded genes with zero similarity to the corresponding core member gene to ensure a more representative trend. The similarities between NFκB/TNF core members and all genes were ranked based on the functional similarity score to hit the identified candidate genes. The significance of the enrichment score was assessed by 10,000 permutation tests.

Identification of highly-voted functional modules within the candidates
To interrogate the biological processes in which the candidate genes participate during carcinogenesis, we performed conventional and network-wise functional enrichment analyses combined with GSEA to identify the highly-voted functional modules for each cancer type. In the conventional functional enrichment analysis, the significance was established by the p-value derived from the hypergeometric test, which is described below: where X denotes the evaluated function; N represents the number of GO annotated genes; m indicates the number of candidates; n represents the number of genes with the evaluated function, and k indicates the number of candidates with the evaluated function.
We also incorporated protein interaction and network-wise functional enrichment analyses to enhance the functional relationship between the identified candidates to discover the functional modules formed by candidates [5]. The human protein-protein interaction (PPI) data source was obtained from the InBio Map database [6]. The network-wise functional enrichment analysis was modified from the conventional method. The test function's significance was based on p-values produced from a modified hypergeometric test. The hypergeometric distribution for the network-wise approach is described below: where e is the abbreviation of the PPI. Each symbol has the same meaning as the conventional hypergeometric distribution, but the counting objects are changed from genes to functional PPIs. The functional PPIs are interactions formed by the two genes involved in the same functions. This approach revealed the significant protein interaction functional modules in which the identified candidates were involved. The two p-values from the conventional and network-wise analysis were adjusted by the Benjamini and Hochberg multiple testing procedures to control the false discovery rate (FDR) [7]. Subsequently, the functional modules with adjusted pvalue < 0.05 from both analyses were identified as significantly enriched functional modules. Additionally, only the significantly enriched functional modules with a depth greater than six in the GO database were used to increase the functional specificity.
Next, we applied GSEA to select the significantly candidate-enriched functional modules that also significantly overrepresent the highly voted genes. Herein, we performed GSEA by ranking the candidates based on their votes in the corresponding cancer type to hit the genes in the gene set of the tested functional module to calculate the enrichment score. The significance of the enrichment score, i.e., z-score, was assessed by 1,000 permutation tests. The functional modules with z-score ≧ 2 were denoted as the highly-voted candidate-enriched functional modules. In a word, the final identified functional modules are formed by the highly voted candidates involved in the same functions, as well as forming the PPI among them. The discovered significantly highlyvoted candidate-enriched functional modules were further summarized by the REVIGO [8] algorithm, with a similarity ≥0.9, which was calculated using the Resink algorithm [9], and visualized using the CirGO package [10].

Enrichment analysis of gene sets and patient groups
In this study, we performed GSEA in the functional similarity analysis, identifying highly-voted candidateenriched functional modules, investigating the identified genes' oncogenicity and poor-prognostic characteristics, and assessing the activity of the highly voted functional modules identified in TNBC analysis. In functional similarity analysis, the similarities between NFκB/TNF core members and genes were ranked to hit the identified candidate genes; in the identification of highly-voted candidate-enriched functional modules, the votes of candidates were ranked to hit candidates in the functional modules; in the investigation of identified genes' oncogenicity and poor-prognostic characteristics, the votes of identified genes were ranked to hit cancerdysregulated genes (C6) and poor-prognostic genes, respectively [11]; in assessing the activity of the modules identified in TNBC analysis, the fold-change of gene expression level were ranked to hit the member genes in the module. The fold change of each gene was calculated from the differential expression analysis between TNBC and non-TNBC implemented by the Limma-Voom R package [13]. Patient set enrichment analysis (PSEA) was performed to study the association between patients' contributions (magnitude of coefficients in the model) to the ensemble model and cancer subtypes. That is, we applied the concept of GSEA to test the correlation between patients' average absolute weight across 1,000 member classifiers and TNBC status, as well as the status of the receptors-estrogen receptor (ER), progesterone receptor (PR), human epidermal growth factor receptor 2 (HER2). More specifically, we tested if patients with higher absolute weight tend to be with ER-, PR-, HER2-, or TNBC. In PSEA, the average absolute weight of each patient in the ensemble model was ranked to hit patients with the investigated subtype. All enrichment analyses used descending rank and unweighted enrichment score (ES) for hits. The significance of the enrichment score, i.e., z-score, was assessed by 10 thousand permutation tests.

Survival analysis of the identified genes
To investigate the impact of the identified (voted) genes on patient survival, we collected a pan-cancer survival influential gene set from our previous study [11]. This pan-cancer survival influential gene set is identified by the Cox-regression model [12]. Specifically, that study developed a systematic process to determine the most appropriate cut-off of significance level for identifying survival influential genes across 16 cancer types: BLCA, BRCA, COAD, ESCA, HNSC, KIRC, LIHC, LUAD, LUSC, STAD, CESC, GBM, LGG, OV, PAAD, and SARC.
The first ten cancer types are also in our ensemble model and referred to as the internal data set. In contrast, the remaining 6 cancers were not used during the model training and designated as the independent data set. By using two separate data sets, we can ensure that the model does not overfit the training data and can accurately predict patient survival for all cancers. Among the 4,678 identified genes in the pan-cancer model, 4,022 are provided with a pre-calculated hazard ratio (the exponential regression coefficient) in the Cox regression model from our previous study. We then calculated patients' risk scores to determine the impact of the identified genes on patient outcomes. For each patient, the risk score is defined below: where n is the number of identified genes, represents the standardized expression level of gene i, and denotes the coefficient of gene i calculated from the Cox regression model in our previous study [11]. Subsequently, we stratified patients into two groups, namely high-risk and low-risk, using the median risk score as the cut-off. The prognostic effect and significance of the identified genes were evaluated using Kaplan-Meier plots and the logrank test.

Classification of breast cancer subtypes
We classified the breast cancer patients into subtypes according to their receptor (ER and HER2) status. First, the patients who are recorded to have three receptors (ER, PR, and HER2) all negative are classified as TNBC type; the other patients are denoted as non-TNBC type, and further classified as types of luminal, LumA, LumB, HER2+, and unclassified. The patients who express ER were first classified as luminal type. The patients of luminal type were then subclassified into LumA (not express HER2) and LumB (express HER2). The patients who do not express ER but express HER2 were classified as HER2+. Finally, the patients who did not belong to the above subtypes (TNBC, luminal, LumA, LumB, HER2+) were categorized as unclassified type.

Prediction model in classifying patients' TNBC status
To test the performance of the identified functional modules in predicting TNBC patients, we trained a logistic regression model for each module using gene expression profiles in the corresponding module. For each module, we performed 100 hold-out processes to assess the potential overfitting. During the hold-out process, 60% of the sampling data was used for training, and the remaining 40% was used for testing. The performance was then evaluated by AUC and AUCPR values derived from the testing data. On the other hand, we merged three datasets, which are TCGA, E-GEOD-58135 [2], and E-GEOD-76250 [3], to train a global model (logistic regression model) to estimate the contribution of each gene in the functional module of mononuclear cell differentiation in predicting TNBC patients. The contribution of each gene was then evaluated by the effect size, which is the z-score estimated by Wald's test, in the model.

The conventional approach for identification of the genes of interest
To demonstrate the performance of our ensemble learning model, we performed the conventional approach that used the Pearson correlation coefficient (PCC) of gene expression to identify the genes highly associated with the 198 NFκB/TNF hallmark genes in carcinogenesis. We first calculated the PCC between each gene and the 198 NFκB/TNF hallmark genes for each cancer type. Then, for each gene, the median of the absolute values of these 198 PCCs was used as its association score with the 198 NFκB/TNF hallmark genes in the corresponding cancer type. In pan-cancer analysis, the mean value of the association score was adopted. We calculated the AUC derived from the association score to predict the NFκB/TNF hallmark genes to evaluate the prediction accuracy. For the enrichment analysis in oncogenicity and functional similarity with NFκB/TNF core members, the top n genes with high association scores were denoted as the conventional approach's candidates. The value of n is the number of voted genes identified by our ensemble learning approach in the corresponding cancer type or pan-cancer analysis. Then, the same procedure of enrichment analysis in oncogenicity and functional similarity with NFκB/TNF core members was applied to the candidates identified by the conventional approach.       Each dot on the curve represents the proportion of cancer-dysregulated genes in the category with an association score no less than the threshold indicated on the x-axis. The association score, the label of the   The figure illustrates Spearman's ρ correlation coefficient of gene votes between ensemble models with 500 and 1,000 member classifiers for each cancer type. Genes with zero votes in both models were excluded from the calculation of Spearman's ρ. The obtained ρ values range from 0.97 to 0.99 across 16 cancers, indicating the high stability of our ensemble model. It also proves that running 1,000 times is sufficient to capture the overall information for the different cancer types.

Figure S10: Enrichment analysis of LUAD subtypes based on smoking history
PSEA was conducted to examine different clinical statuses with LUAD subtypes based on the patient's smoking history. The LUAD subtypes were classified based on smoking history. The subtypes include (a) lifelong nonsmokers (less than 100 cigarettes smoked in a lifetime); (b) smokers, comprising current smokers (including daily smokers and nondaily smokers or occasional smokers), current reformed smokers for more than 15 years, current reformed smokers for 15 years or less, and current reformed smokers, duration not specified. n represents the number of patients with the corresponding smoker type in TCGA, whereas "hit" represents the subset of these patients for whom RNA-Seq data is available.

GOID
Description N E Adj_Npva Adj_Epva